Reconstruction of gene regulatory networks for Caenorhabditis elegans using tree-shaped gene expression data

Abstract Constructing gene regulatory networks is a widely adopted approach for investigating gene regulation, offering diverse applications in biology and medicine. A great deal of research focuses on using time series data or single-cell RNA-sequencing data to infer gene regulatory networks. However, such gene expression data lack either cellular or temporal information. Fortunately, the advent of time-lapse confocal laser microscopy enables biologists to obtain tree-shaped gene expression data of Caenorhabditis elegans, achieving both cellular and temporal resolution. Although such tree-shaped data provide abundant knowledge, they pose challenges like non-pairwise time series, laying the inaccuracy of downstream analysis. To address this issue, a comprehensive framework for data integration and a novel Bayesian approach based on Boolean network with time delay are proposed. The pre-screening process and Markov Chain Monte Carlo algorithm are applied to obtain the parameter estimates. Simulation studies show that our method outperforms existing Boolean network inference algorithms. Leveraging the proposed approach, gene regulatory networks for five subtrees are reconstructed based on the real tree-shaped datatsets of Caenorhabditis elegans, where some gene regulatory relationships confirmed in previous genetic studies are recovered. Also, heterogeneity of regulatory relationships in different cell lineage subtrees is detected. Furthermore, the exploration of potential gene regulatory relationships that bear importance in human diseases is undertaken. All source code is available at the GitHub repository https://github.com/edawu11/BBTD.git.


II Algorithm settings
To validate BBTD, four Boolean network methods are used to compare the performance with BBTD, including BB [1], REV [2], BFE [3] and ATEN [4].The model of BB is given as follows: where All unknown parameters are estimated by our proposed prescreening process and the Bayesian inference framework.Both REV and BFE are implemented in the R package BoolNet [5].Since the poor capability to handle nondeterministic network models, REV is not used in all datasets.ATEN is run using the R package ATEN.The default parameters are used for most of the software packages.For example, the parameter maxK of BFE is set to 5, and the number of iterations of ATEN is set to 5,000.

III Comparation of four Boolean network inference methods
In this study, BBTD, BB, BFE and ATEN are applied to the synthesized datatsets of C. elegans.Table S3 summarizes the information of calculation time and memory usage these four methods.

IV Trace plots of the log-posterior probability of five MCMC chains for five real subtree datasets
Figure S3 shows the log-posterior probability of five MCMC chains for each real subtree datasets.It is observed that the chains converged after about 10,000 iterations.

V Gene ontology enrichment analysis
In this study, gene ontology (GO) enrichment analysis is conducted using clusterProfiler v4.6.2 [6].The cutoffs for the p-value and q value are set at 0.05 and 0.20, respectively.The complete results of significant over-representation of functional categories related to each pair of regulator-regulated genes are presented in Table S4.The corresponding descriptions of GO ID within Table S4 are provided in Table S5.

Figure S1 :
Figure S1: Missing value plots for five real subtree datasets.The X-axis represents gene and the Y-axis represents cell.Gene and cell names are omitted in each sub-figure.'NA' indicates that the gene is missing in the cell, while 'Not NA' means that the gene has a time series of expression rates in the cell.The percentages in parentheses represent the ratio of 'NA' or 'Not NA'.(a) Missing value plot of the 'AB' subtree.(b) Missing value plot of the 'C' subtree.(c) Missing value plot of the 'D' subtree.(d) Missing value plot of the 'E' subtree.(e) Missing value plot of the 'MS' subtree.

Figure S2 :
Figure S2: Missing value plots for five real subtree datasets after identifying the candidate cells and genes.The X-axis represents candidate gene and the Y-axis represents candidate cell.Gene and cell names are omitted in each sub-figure.'NA' indicates that the gene is missing in the cell, while 'Not NA' means that the gene has a time series of expression rates in the cell.The percentages in parentheses represent the ratio of 'NA' or 'Not NA'.(a) Missing value plot of the 'AB' subtree.(b) Missing value plot of the 'C' subtree.(c) Missing value plot of the 'D' subtree.(d) Missing value plot of the 'E' subtree.(e) Missing value plot of the 'MS' subtree.

Figure S3 :
Figure S3: Trace plots of the log-posterior probability of five MCMC chains for five real subtree datasets.Each chain is run for 20,000 iterations.(a) Trace plots of the 'AB' subtree.(b) Trace plots of the 'C' subtree.(c) Trace plots of the 'D' subtree.(d) Trace plots of the 'E' subtree.(e) Trace plots of the 'MS' subtree.

Table S1 :
Candidate cells with cell fates for five real subtree datasets.

Table S2 :
Candidate genes used for gene regulatory network inference for each subtree.

Table S3 :
Average calculation time and memory usage for BBTD and three other Boolean network inference methods across five synthesized subtree datasets.The value in parentheses indicates the memory usage.

Table S5 :
Description of GO ID.

Table S6 :
Confirmed gene regulatory relationships among the candidate genes of each subtree retrieved from BioGRID Version 4.4.233 and WormBase Version WS254.